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Abstract 



We give an unique string representation, up to isomorphism, for initially connected 
^ ^ ■ deterministic finite automata (ICDFA's) with n states over an alphabet of k symbols. 

I ', We show how to generate all these strings for each n and fc, and how its enumeration 

provides an alternative way to obtain the exact number of ICDFA's. 



1 Motivation 



> 

t~ — ' In symbolic manipulation environments for finite automata, it is important to have an 

^ ■ adequate representation of automata and, dependent upon their use, several representa- 

tions may be available. For example, for testing if two finite automata are isomorphic 
objects or for (random) generation of automata, the representation must be compact and 



' somehow canonical. In the FAdo project [MROSal Ifadj a canonical form is used to test if 

On . two minimal DFA's are isomorphic (i.e are the same up to renaming of states). In this 

' paper we prove the correctness of that representation and show how it can be used for 

the exact enumeration and generation of initially connected deterministic finite automata 
^ . (ICDFA). The problem of enumeration of finite automata was considered by several au- 

^ I thors since early 1960s, in particular see Robinson |Rob85| . Harary and Palmer |HP67] 

and Liskovets |Lis69j amongst many others. A survey may be found in Domaratzki et 
al. |DKS02| . More recently, several authors examined related problems. Domaratzki et 
al. |DKS02] studied the enumeration of distinct languages accepted by finite automata 
with n states; Nicaud jNic99] . Champarnaud and Paranthoen |CP051 [Par04] and Bassino 
and Nicaud [BNj analysed several aspects of the average behaviour of regular languages; 
Liskovets |Lis03j and Domaratzki |Dom04j gave (exact and asymptotic) enumerations of 
acyclic DFA's and of finite languages. The paper is organised as follows. In the next 
section, we review some basic notions and introduce some notation. Section [3] describes a 
string representation for deterministic finite automata that is unique up to isomorphism for 
initially connected deterministic finite automata. Section 2] presents an efficient method to 
generate those strings. Section [5] shows how their enumeration provides an upper bound 
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and the exact value for the number of ICDFA's. Section [6] concludes with some final 
remarks. We address the reader attention to the longer version of this paper for some 
implementation issues and experimental result^. 



2 Preliminaries 

We first recall some basic notions from automata theory and formal languages, that can be 
found in standard books [HMUOO] . An alphabet S is a nonempty set of symbols. A string 
over S is a finite sequence of symbols of S. The empty string is denoted by e. The set S* 
is the set of all strings over S. A language L is a subset of S*. The density of a language 
L over S, pL{n), is the number of strings of length n that are in L, i.e., pii^) = [L n 
If Li,L2 C S*, L1L2 = {xy I X G Li and y € L2}. A regular expression (r.e.) a over S 
represents a language L{a) C S* and is inductively defined by: 0, e and o" G S are a r.e., 
where L(0) = 0, L(e) = {e} and L{a) = {a}; if ai and 02 are r.e., (qi + 02), (0102) and 
ai are r.e., respectively with L((ai +02)) = L{ai) UL(a2), ^((0102)) = L{ai)L{a2) and 
L{ai*) = L(ai)*. In this paper, we will use regular expressions to represent descriptions of 
finite automata. A deterministic finite automaton (DFA) ^ is a quintuple {Q,Ti,S,qo,F) 
where Q is a finite set of states, S is the alphabet, (5 : Q x S — > Q is the transition 
function, qq the initial state and F Q the set of final states. We assume that the 
transition function is total, so we consider only complete DFA's. The size of a DFA is the 
number of its states, \Q\. Normally, we are not interested in the labels of the states and we 
can represent them by an integer < i < \Q\. The transition function 5 extends naturally 
to S*: for all g G <5, if X = e then 6{q, e) = q; if x = ya then 5{q, x) = 5{5{q, y),cr). A DFA 
is initially connected (ICDFA) if for each state q ^ Q there exists a string x G such 
that 6{qo,x) = q. Two DFA's A = {Q,T.,S,qo, F) and A' = {Q' ,T.,5' ^q'^.F') are called 
isomorphic (by states) if there exists a bijection f : Q ^ Q' such that f{qo) = q^ and for 
all (T G S and q G Q, f{6{q, a)) = 5'{f{q),a). Furthermore, for all g G Q, g G F if and only 
if f{q) G F'. The language accepted by a DFA A is L{A) = {x G E* | S{qQ,x) G F}. Two 
DFA are equivalent if they accept the same language. Obviously, two isomorphic automata 
are equivalent, but two non- isomorphic automata may be equivalent. A DFA A is minimal 
if there is no DFA A' with fewer states equivalent to A. Trivially a minimal DFA is an 
ICDFA. Minimal DFA's are unique up to isomorphism. We are mainly concerned with 
the representation of the transition function of DFA's , so we disregard the set of final 
states and we consider only a quadruple (Q, S, 5, go) called the structure of an automaton 
and referred as DFAg. For each of our representations, there will be 2" DFA's. We denote 
by ICDFAg the structure of an ICDFA. We consider that any integer variable has always 
a nonnegative value (if not otherwise stated). Let [n]o = {0, 1, . . . , n} and [n] = {1, . . . , n}. 

3 Representations towards a normal form 

The method used to represent a DFA has a significative role in the amount of computer 
work needed to manipulate that information, and can give an important insight about 
this set of objects, both in its characterisation and enumeration. Let us disregard the 

^http://www.dcc.fc.up.pt/Pubs/TR05/dcc-2005-04.ps.gz 
^Also called accessible. 
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set of final states of a DFA. A naive representation of a DFA0 can be obtained by the 
enumeration of its states and for each state a list of its transitions for each symbol. For 
the DFAg in Fig{T]we have: 

[[A (a : A,b : B)], [B (a : A,h : E)], [C (a : B,h : E)], 

[D{a:D,h:C)],[E{s.:A,h:E)]]. (1) 




Figure 1: A DFA with no final states marked 



Given a complete DFA0 (Q, S, 5, qo) with \Q\ = n and |S| = k and considering a total 
order over S, the representation can be simplified by omitting the alphabetic symbols. 
For our example, we would have 



[[A {A, B)], [B {A, E)l [C (B, E)], [D {D, C)], [E (A, E)]]. 



(2) 



The labels chosen for the states have a standard order (in the example, the alphabetic 
order). We can simplify the representation a bit if we use that order to identify the states, 
and because we are representing complete DFA0's we can drop the inner tuples as well. 
We obtain 

[0,1,0,4,1,4,3,2,0,4]. (3) 

Because this representation depends on the order we label the states, we have more 
than one representation for each DFA0. Can we have a canonical order for the set of the 
states? Let the first state be the initial state qq of the automaton, the second state the 
first one to be referred (excepting qq) by a transition from qq, the third state the next 
referred in transitions from one of the first two states, and so on... For the DFA0 in the 
example, this method induces an unique order for the first three states (^4, B,E), but then 
we can arbitrate an order for the remaining states {C,D). Two different representations 
are thus admissible: 



[0,1,0,2,0,2,3,4,1,2] and [0,1,0,2,0,2,1,2,4,3]. 



(4) 



If we restrict this representation to ICDFAg's, then this representation is unique and 
defines an order over the set of its states. In the example, the DFA0 restricted to the 
set of states {A,B,E} is represented by [0,1,0,2,0,2]. Let E = {ai \ i < fc}, with 
(To < fJi < • • • < CTfc^i. Given an ICDFA0 ((5,S,(5, go) with \Q\ = n, the representing 
string is of the form [(S'i)i<fe„] with 5j G [n - l]o and Si = 5{[i/k\,ai modfc)- 
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Lemma 1. Let [(5'i)j<fc„] be a representation of a complete ICDFA0 {Q,Ti, S, qq) with 
\Q\ = n and |S[ = k, then: 

(Vm > l)(Vi)(5» = m ^ <i)Sj = m- 1)) (Rl) 
{ym e [n - l]){(3j < km) Sj = m) (R2) 

Proof. The condition Rl establishes that a state label (greater than 1) can only occur 
after the occurrence of its predecessors. This is a direct consequence of the way we defined 
the representing string. Suppose R2 does not verify, thus there exists a state m that 
does not occur in the first km symbols of the string (the m first state descriptions). 
Because the automaton is initially connected there must be a sequence of states (mj)j</ 
and symbols {(Ti)i<i such that mg = 0, mi = m and 6{mi,ai) = mj+i for i < I. We 
must have < m < m;_i because m appears in the mi_i description and we supposed 
no occurrences of m in the first m state descriptions. There must exist /' < I such that 
m//_i < m < mi', implying that mi' G {Si \ i < km}. This contradicts Rl because we 
are supposing that m {Si \ i < km} and m < mi'. Thus R2 is verified. □ 

Note that the conditions Rl and R2 are independent. For k = 2 and n = 3, the string 
[2, 1, 0, 0, 1, 0] satisfies R2 but not Rl, and the opposite occurs for the string [0, 0, 1, 1, 0, 2]. 

Lemma 2. Every string [(5j)i<fc„] with Si (z [n — l]o satisfying Rl and R2 represents a 
complete ICDFA0 with n states over an alphabet of k symbols. 

Proof. Let [{Si) 

i<kn\ be a string in the referred conditions, and consider the associated 
automaton A using the string symbols as labels for the corresponding states. By its 
construction, ^ is a DFA0. We only need to prove that it is initially connected. Let m 
be a state of the automaton. A proof that m is reachable from the initial state can be 
done by induction on m. If m = there is nothing to prove. If m = 1 then, by R2, 
1 must occur in the description of state 0, making state 1 reachable from state 0. Let 
us suppose that every state m' < m is reachable from state and prove that state m is 
reachable too. By R2, m occurs at least once before position km, say in position km' + i 
with m' < m and i < k. Then for some symbol a, 5{m' , a) = m. By induction hypothesis, 
state m' is reachable from state 0, thus state m is reachable too and the automaton is 
initially connected. Now consider the string representation obtained for A, [('S'j')j<fcn]- By 
Lemma [1] it satisfies Rl and R2. It is easy to see that this representation is the same as 
[iSi)i<kn]- By Rl, Sq = S'q. Suppose that (Vi < j){Si = S^). Now we prove that S'j = Sj. 
By Rl, either Sj G {Si \ i < j} or Sj = maxjS'j \ i < j} + 1. In the first case, there exists 
/ < j such that Sj = Si and, by induction hypothesis, Si = S'l, thus 
Analogously, by Rl, in the second case we have that 

Sj = maxjS'j \ i < j} + 1 = Sj 

□ 

Theorem 1. There is a one-to ] with Si € [n — l]o 

satisfying Rl and R2, and the non-isomorphic ICDFA0 's with n states, over an alphabet 
S of size k. 
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Proof. Let {Q,T,,6,qo) and (Q', S, 5', ^q) be two ICDFA0's and [{Si)i<kn] and [{S-)i<kn] 
their representing strings. By Lemma [H these strings satisfy Rl and R2. Suppose that 
/ : Q — > Q' is an isomorphism between the ICDFA0's. Then = qo and f{qo) = q'o = 0. 
Either Sq = 6{qo, ao) = go = or 5o = 6{so, do) = 1 (by Rl). 

i) If So = then f{qo) = /(5(go,o-o)) = <J'(9o,o-o) = Sq = 0, because 6{qo,ao) = qo 
imphes 6'{f{qo),ao) = f{qo). 

ii) If So = 1 then /(I) = do) = S'q^O, thus S'^ = 1, again by Rl. 

Supposing that (Vi < j){Si = S- A/(Si) = S-) we need to prove that Sj = SjAf{Sj) = Sj. 
Triviahy we have {Si \ i < j} = {S'^ \ i < j}. We know that Sj = S{lj/k\,aj modfe), and 
by R2 there exists I < j such that [j/k\ = S/ thus f{lj/k\) = f{Si) = Si = S[ = [j'/fcj 
by induction hypothesis. We have 

Sj = S'{[j/k\,aj modk) = S'{f {[j/k\), a j mod k) = f /k\,aj mod k)) = f{Sj). 
By Rl, either Sj G {Sj \ i < j} or Sj = max{Sj \ i < j} + 1. 

i) If Sj G {Si \ i < j} then there exists / < j such that Sj = Si and S/ = S[. Then 

1 ^j mod k 1 mod k 1 ^j mod k)) = f{S{[l/k\,ai mod k)) 

<^ 5'{lj/k\,aj modk) = 5'{ll/k\,aimodk) 
Thus Sj = Si imphes Sj = S^ and so S'j = Sj. 

ii) If Sj = max{Sj \ i < j} + 1 then Sj {Sj \ i < j} because if there exists a I < j 
such that S'l = S'j by the same reason as before Sj G {Si \ i < j}. Thus, by Rl 
Sj = max{Sj \ i < j} + 1 = Sj. 

Conversely, by Lemma [H we have that each string represents a ICDFA0 up to a 
compatible renaming of states, i.e., if two ICDFAg's are represented by the same string, 
that representation defines a isomorphism between them. □ 

These string representations lead to a normal representation for ICDFA0's. For each 
of them, if we add a sequence of final states, we obtain a normal form for ICDFA's. 

4 Generating automata 

Normal representations for ICDFA0's (as presented above) can be used as compact com- 
puter representations for that kind of objects, but even though rules Rl and R2 are quit 
simple, it is not evident how to write an enumerative algorithm in an efficient way. In a 
string representing an ICDFA0 with n states over an alphabet of k symbols, [{Si)i<^kn]i 
let ifj)o<j<n be the sequence of indexes of the first occurrence of each state label j. That 
those indexes exist is a direct consequence of the way the string is constructed. Now 
consider 61 = /i, bj = fj — fj-i, for 2 < j < n — 1 and bn = kn — fn~i + 1- Note that 
ELi bi = fj,ioi2<j <n-l. 
Note that 

3 

^bi = fj , for j G [n - 1]. 
1=1 

It is easy to see that 
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1. Rule Rl simply states that 



(V2< j<n-l)(6j >0). (Gl) 

2. Rule R2 establishes that 

(VmG [n -!])(/„ <M- (G2) 

To generate all the automata, for each allowed sequence of {bj)o<j<n we can generate all 
the remaining symbols Si (those with i ^ {fj | < j < n}) according to the following 
rules: 

i<bi Si = 0; (G3) 
(Vj E[n- 2]){fj < i < fj+i ^ Si^ [j]o); (G4) 
i > ^ Si G [n - l]o. (G5) 

5 Enumeration of ICDFA's 

In this section we obtain a formula Bk{n) for the number of strings [(5j)j<fcn] representing 
ICDFA0's with n states over an alphabet of k symbols. Although it is already known a 
formula for the number of non-isomorphic ICDFA0's, we think that our method is new. 
Liskovets [Lis69j and, independently, Robinson |Rob85j gave for that number the formula 
Hk{n) = -^rYj! where /ifc(l) = 1 and for n > 1 

l<j<n ^ 

Note that n^^ is the number of transition functions, from which we subtract the number 
of them that have n — 1, n — 2,. . . ,1 states not accessible from the initial state. And then, 
we may divide by (n — 1)!, as the names of the remaining states (except the initial) are 
irrelevant. Reciprocally, the formula we will derive (i?fc(n)) is a direct positive summation. 

First, let us consider the set of strings [(5'j)j<fc„] with 5j G [n — l]o and satisfying only 
rule Rl. The number of these strings gives an upper bound for Bk{n). This set can be 
given by An PI [n — IJq", where for c > 0, 

c— 1 i 

A, = L(0* + ^0*njXO + ---+in. (6) 

i=l j=l 

These languages belong to a family of languages presented by Moreira and Reis [MR05b] 
and that represent partitions of [n] with no more than c > 1 parts, i.e., 

Lc = L{j2\{3{l + ---+jr). (7) 

i=i j=i 

We have that /0^^(f^) = PlXi^ + 1) and that Phdi^) = Si=i 'S'(?^, i), where S{n,i) are 
Stirling numbers of second kind. So we get that the number of strings of length kn that 
are in An, is pA„{kn) = Yll=i S{kn + We have the proposition. 
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Proposition 1. For all n, k > 1, Bk{n) < Y17=i S{kn + 

For n = 3 and k = 2, B2{3) < 365. For k = 2, Bassino and Nicaud |BNj presented a 
better upper bound, namely that B2{n) < nS{2n,n). 

Now let us consider only the rule R2. This rule can be formulated as 

n— 1 km—l 

A V = (8) 

m=l j=0 

From this formula it is easy to see that the strings [(S'i)i<fcn] with Si S [n — l]o and 
satisfying only rule R2 can be represented by the regular expression 

n— 1 km— 2 

n 5^ (0 + • • • + (m - l)ym{0 + . . . + (n - (9) 

m=l j=Q 

where we extended the operators of regular expressions to intersection. 

Now in order to simultaneously satisfy rules Rl and R2, in formula ([9]), the first 
occurence of m must precede the one of m — 1, for 2 < m < n — 1. These positions are 
exactly the sequence (/j)o<j<n defined in Section [H Given these positions and considering 
the correspondent sequence (&j)o<j<n we obtain the regular expression: 

11(0 + ... + (j - (0 + . . . + (n - 

and we must consider the possible values of ibj)o<j<n, constrained to Gl and G2: 

k 2k-bi fe(n-l)-Er=i^fei /n-1 \ 
fei=l b2=l b„-i=l \j=l J 

For n = 3 and k = 2 we have 

(01 + 1(0 + 1))((0 + 1)2 + 2(0 + 1 + 2))(0 + 1 + 2)2 + 12(0 + 1 + 2)^ 

and the number of these strings is (1 + 2) ((2 + 3)3^) + 3^ = 216. 

For each sequence (&j)o<j<n the number of strings [{Si)i^kn] with S'j € [n — l]o and 
satisfying Rl and R2 is 

n 

(10) 

a direct consequence of rules G3, G4 and G5. And then we must take the sums over all 
bj constrained to rules Gl and G2 

Theorem 2. We have 



k 2k-bi 3k-bi-b2 Hn-l)-^,"^^ bi „ 

^.(-) = E E E ••• E W ' (11) 

bi=l 62 = 1 b3 = l fen-l = l i = l 
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Proof. It is an immediate consequence of rules Gl to G5. □ 

Corollary 1. The number of non-isomorphic ICDFA's with n states over an alphabet of 
k symbols is 2"i?fc(n). 

Proof. By Theorems [1] and [2] and considering the possible sets of final states. □ 

6 Conclusion 

The method described in Section U] was implemented and used to generate all ICDFAg's 
for k = 2 and n < 10, and k = 3 and n < 7. The time complexity of the program is linear 
in the number of automata and took about a week to generate all the referred ICDFA0's, 
in a PPC G4 1.5MHz. 

One of the advantage of this method is that only the allowed strings are computed so it 
is not a generate- and-test algorithm and because automata are generated in lexicographic 
order it is easy to generate them as needed for consumption by another algorithm. 

If an ICDFA with n states accepts a finite language then there exists a topological 
order of its states such that 6{i, a) > i, for alH < re — 1 and o" € S. But the order we used 
for string representations is not a topological order. So we can not determine directly 
from the string if the accepted language is finite, as was done by Domaratzki |Dom04] 
only for finite languages. Although the formula Bk{n) is quite similar to the one obtained 
in |Dom04j for an upper bound of the number of finite languages, the meaning of the 
parameters (bj) are not directely related. 
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